In this notebook, we explore the application of Independent Component Analysis on the top 28 S&P 500 stock returns from 11/06 to 11/16. We provide the data at https://github.com/Freshdachs/FinData .
import numpy as np
import pandas as pd
import datetime
import plotly.offline as py
import plotly.graph_objs as go
from sklearn.decomposition import FastICA, PCA
from itertools import islice
import functools
py.init_notebook_mode(connected=True)
plot = lambda d,lbls:py.iplot([go.Scatter(x=d.index,y=d[i],name=i) for i in lbls])
plot_off = lambda d,lbls:py.iplot([go.Scatter(x=d.index,y=d[l]+2*i) for (i,l) in enumerate(lbls) ])
#load timeseries
df=pd.read_csv('tss3.csv', sep=';',index_col=0,parse_dates=True,decimal=',')
#sort columns by sum
df = df[df.columns[np.argsort(df.apply(sum).values)][::-1]]
We use the daily adjusted closing prices of 441 stocks on 2458 day from 2011-11-1 to 2016-11-1 out of the S&P 500. We retrieved the data from Quandl and cleaned it up, so that only stocks and days are remaining where we have complete data available.
def pre_diff(df):
d=df.copy()
d.values[1:] =df.values[1:]-df.values[:-1]
return (d.drop([d.index[0]]),df[0:1])
def pre_mean(df):
d=df.copy()
return (d.apply(lambda i: i-np.mean(i)),d.apply(np.mean))
def pre_normalize(df):
d=df.copy()
fac=np.max(np.max(np.abs(d)))
return (d.apply(lambda i: i/fac), fac)
#fac=d.apply(lambda d: np.max(np.abs(d)))
#fac = d.apply(np.var)
#return (d/fac,fac.values)
def pre_process(df):
d, base=pre_diff(df)
d, means = pre_mean(d)
d, fac=pre_normalize(d)
return (d,base,means,fac)
pre_df, base,means,fac = pre_process(df)
idcs = np.argsort(df.apply(max).values)
lbls = df.columns[:8].values
Below you can find a visualization of the top 8 stock returns.
plot(df,lbls)
Below we compare the differential, normalized returns with the regular returns.
plot_off(pre_df,lbls)
in this section, we apply the ICA approach to our data.
# load ICA and fit our data
ica = FastICA(whiten=False,max_iter=1000)
#reduce datasize
d, base,means,fac = pre_process(df[df.columns[:28]])
# compute sources and invert-transformed signal S from dataframe d
def ica_d(d):
ica_d = ica.fit(d)
S = ica_d.transform(d)
return (S, ica_d.mixing_, ica_d.inverse_transform(S))
S,A,x = ica_d(d)
def reconstruct(df,x,base,means,fac):
d=df.copy()
x = x*fac+means.values
d.values[:] = (np.cumsum(x,0)+base.values)[:]
return base.append(d)
def ranking(S,A):
return np.apply_along_axis(lambda row: np.argsort(np.apply_along_axis(max,0,np.abs(S*row))),1,A)
def indices(R,i=0,n=8):
return (R[i][-n:],R[i][:-n])
def sep_ics(S,A,R,i=0, n=8):
return list(np.dot(S[:,f],A[:,f].T) for f in indices(R,i,n))
def to_df(d,v):
e=d.copy()
e.values[:]=v[:]
return e
plot_off(to_df(d,S),d.columns[indices(R)[0]].values)
R=ranking(S,A)
X_maj, X_min =sep_ics(S,A,R)
rec_df = reconstruct(d,X_maj,base,means,fac)
min_df = reconstruct(d,X_min,base,means,fac)
plot(rec_df,lbls)
plot(min_df,lbls)
plot(df,lbls)
Here we see that both the minor components and the major components can be reconstructed to similar approximations. We see upwards trends in both our components, how can the be? We substract the mean from our data during the preprocessing. This means that even with a 0 signal from our reconstruction, we still add the mean during the reconstruction. Since we reconstruct the derivate, this addition translate to a constant slope.
This shows that we can successfully perform an ICA decomposition and back again.
Now we want to take a look at the thresholded Reconstruction.
#reconstruct whitened X with activation function g
array_with_g = lambda g,S,A: np.array([[np.sum(np.vectorize(g)(row_S*row_A))for row_A in A] for row_S in S])
activation = lambda e: lambda i: i if (i*i)>(e*e) else 0
activation_tst = lambda e: lambda i: 1 if (i*i)>(e*e) else 0
activation_inv = lambda e: lambda i: 0 if (i*i)>(e*e) else i
def sep_ics_g(S,A,e):
return (array_with_g(activation(e),S,A),
array_with_g(activation_inv(e),S,A),
np.mean(array_with_g(activation_tst(e),S,A))/A.shape[0])
S,A,x = ica_d(d)
X_maj_t, X_min_t, rat = sep_ics_g(S,A,0.01)
rat
df_maj, df_min = (reconstruct(d,x,base,means,fac) for x in (X_maj_t,X_min_t))
[plot(df,lbls) for df in (df_min,df_maj,d)]
We can see that the thresholded reconstruction with a threshold of 0.01 and a keeping rate of ~ 0.2 looks mostly fine, but has some serious issues with certain stocks (CMG) which drop off a lot. Maybe at these points, we have a lot of small weights across the different components, which would usually cancel out a stronger signal?
We can see that the global maximum is really good at reconstructing Priceline (the biggest stock) and not much else. However, even with a really tiny keeping ratio, we still see some information in the smaller stocks. However, since these are the 8 biggest stocks, these might also be the most resilient ones.
In this section, we analyze what results a PCA approach yields.
pca = PCA()
# compute sources and invert-transformed signal S from dataframe d
def pca_d(d):
pca_d = pca.fit(d)
S=pca_d.transform(d)
return (S, pca_d.components_.T, pca_d.inverse_transform(S))
S,A,x = pca_d(d)
R = ranking(S,A)
X_maj, X_min =sep_ics(S,A,R)
rec_df, min_df = (reconstruct(d,x,base,means,fac) for x in (X_maj,X_min))
[plot(d,lbls) for d in (rec_df,min_df,df)]
S,A,x = pca_d(d)
X_maj_tp, X_min_tp, rat =sep_ics_g(S,A,0.01)
rec_df, min_df = (reconstruct(d,x,base,means,fac) for x in (X_maj_tp, X_min_tp))
[plot(rec_df,lbls) for d in (rec_df,min_df,df)]
rat
We can see that PCA does rather well at reconstructing with an eps of 0.1 and a ratio of <2%.
The variance norm does not work all that well as we have serious issues with the reconstruction, even at a rate of 33%.
local max norm works also pretty good, as we have pretty nice reconstructions, though with a mysterious drop off in CMG at a rate of 21%.
plot_off(to_df(d,S),d.columns[indices(R)[0]].values)
Random collection of code snippets + more.
topsources = lambda S,n: np.argsort( np.apply_along_axis(lambda x:np.max(np.abs(x)),0,S))[-n:]
topsources(S,8)
#show top
py.iplot([go.Scatter(y=S[i]) for i in topsources(S,8)])
np.argsort( np.apply_along_axis(np.max,0,S))